You are given a set of data on housing sale prices for the last few years in King County (near Seattle) between May 2014 and May 2015.
We want you to build an explanatory model for the
price of housing in King County, i.e. an interpretable
model in which the included variables are statistically justifiable.
The variable definitions are:
id - Unique ID for each home sold
date - Date of the home sale
price - Price of each home sold
bedrooms - Number of bedrooms
bathrooms - Number of bathrooms, where .5 accounts for a
room with a toilet but no shower
sqft_living - Square footage of the apartments interior
living space
sqft_lot - Square footage of the land space
floors - Number of floors
waterfront - A dummy variable for whether the apartment was
overlooking the waterfront or not
view - An index from 0 to 4 of how good the view of the
property was
condition - An index from 1 to 5 on the condition of the
apartment
grade - An index from 1 to 13, where 1-3 falls short of
building construction and design, 7 has an average level of construction
and design, and 11-13 have a high quality level of construction and
design
sqft_above - The square footage of the interior housing
space that is above ground level
sqft_basement - The square footage of the interior housing
space that is below ground level
yr_built - The year the house was initially built
yr_renovated - The year of the house’s last
renovation
zipcode - What zipcode area the house is in
lat - Lattitude
long - Longitude
sqft_living15 - The square footage of interior housing
living space for the nearest 15 neighbors
sqft_lot15 - The square footage of the land lots of the
nearest 15 neighbors
Tidy up the data ready for regression:
* You might like to think about removing some or all of `date`, `id`, `sqft_living15`, `sqft_lot15` and `zipcode` (`lat` and `long` provide a better measure of location in any event).
* Have a think about how to treat `waterfront`. Should we convert its type?
* We converted `yr_renovated` into a `renovated` logical variable, indicating whether the property had ever been renovated. You may wish to do the same.
* Have a think about how to treat `view`, `condition` and `grade`? Are they interval or categorical ordinal data types?
library(GGally)
library(modelr)
library(tidyverse)
houses <- read_csv("data/kc_house_data.csv")
## Rows: 21613 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): id
## dbl (19): price, bedrooms, bathrooms, sqft_living, sqft_lot, floors, waterf...
## dttm (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(houses)
## Rows: 21,613
## Columns: 21
## $ id <chr> "7129300520", "6414100192", "5631500400", "2487200875", …
## $ date <dttm> 2014-10-13, 2014-12-09, 2015-02-25, 2014-12-09, 2015-02…
## $ price <dbl> 221900, 538000, 180000, 604000, 510000, 1225000, 257500,…
## $ bedrooms <dbl> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2,…
## $ bathrooms <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.…
## $ sqft_living <dbl> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 189…
## $ sqft_lot <dbl> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470,…
## $ floors <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1…
## $ waterfront <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ view <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0,…
## $ condition <dbl> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4,…
## $ grade <dbl> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7…
## $ sqft_above <dbl> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, 189…
## $ sqft_basement <dbl> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, 0, 0, …
## $ yr_built <dbl> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 20…
## $ yr_renovated <dbl> 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ zipcode <dbl> 98178, 98125, 98028, 98136, 98074, 98053, 98003, 98198, …
## $ lat <dbl> 47.5112, 47.7210, 47.7379, 47.5208, 47.6168, 47.6561, 47…
## $ long <dbl> -122.257, -122.319, -122.233, -122.393, -122.045, -122.0…
## $ sqft_living15 <dbl> 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, 1780, 23…
## $ sqft_lot15 <dbl> 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711, 8113, …
# tidy up data. In particular treat condition and grade as factor, as they are
# ordinal categorical
houses_tidy <- houses %>%
select(-c("id", "date", "sqft_living15", "sqft_lot15", "zipcode")) %>%
mutate(waterfront = as.logical(waterfront)) %>%
mutate(renovated = yr_renovated != 0) %>%
select(-"yr_renovated") %>%
mutate(view = as_factor(view)) %>%
mutate(condition = as_factor(condition)) %>%
mutate(grade = as_factor(grade))
glimpse(houses_tidy)
## Rows: 21,613
## Columns: 16
## $ price <dbl> 221900, 538000, 180000, 604000, 510000, 1225000, 257500,…
## $ bedrooms <dbl> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2,…
## $ bathrooms <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.…
## $ sqft_living <dbl> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 189…
## $ sqft_lot <dbl> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470,…
## $ floors <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1…
## $ waterfront <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
## $ view <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0,…
## $ condition <fct> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4,…
## $ grade <fct> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7…
## $ sqft_above <dbl> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, 189…
## $ sqft_basement <dbl> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, 0, 0, …
## $ yr_built <dbl> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 20…
## $ lat <dbl> 47.5112, 47.7210, 47.7379, 47.5208, 47.6168, 47.6561, 47…
## $ long <dbl> -122.257, -122.319, -122.233, -122.393, -122.045, -122.0…
## $ renovated <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
Check for aliased variables using the alias() function
(this takes in a formula object and a data set). [Hint
- formula price ~ . says ‘price varying with all
predictors’, this is a suitable input to alias()]. Remove
variables that lead to an alias. Check the ‘Elements of multiple
regression’ lesson for a dropdown containing further information on
finding aliased variables in a dataset.
Solution
# Alias is useful to check if we have aliased variables, i.e. one or more
# variables that can be computed from other variables
alias(price ~ ., data = houses_tidy)
## Model :
## price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors +
## waterfront + view + condition + grade + sqft_above + sqft_basement +
## yr_built + lat + long + renovated
##
## Complete :
## (Intercept) bedrooms bathrooms sqft_living sqft_lot floors
## sqft_basement 0 0 0 1 0 0
## waterfrontTRUE view1 view2 view3 view4 condition2 condition3
## sqft_basement 0 0 0 0 0 0 0
## condition4 condition5 grade3 grade4 grade5 grade6 grade7 grade8
## sqft_basement 0 0 0 0 0 0 0 0
## grade9 grade10 grade11 grade12 grade13 sqft_above yr_built lat
## sqft_basement 0 0 0 0 0 -1 0 0
## long renovatedTRUE
## sqft_basement 0 0
# seems that sqft_basement can be computed from sqft_living - sqft_above.
# let's drop sqft_living leaving just the two contributions sqft_basement and
# sqft_above
houses_tidy <- houses_tidy %>%
select(-"sqft_living")
glimpse(houses_tidy)
## Rows: 21,613
## Columns: 15
## $ price <dbl> 221900, 538000, 180000, 604000, 510000, 1225000, 257500,…
## $ bedrooms <dbl> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2,…
## $ bathrooms <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.…
## $ sqft_lot <dbl> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470,…
## $ floors <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1…
## $ waterfront <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
## $ view <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0,…
## $ condition <fct> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4,…
## $ grade <fct> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7…
## $ sqft_above <dbl> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, 189…
## $ sqft_basement <dbl> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, 0, 0, …
## $ yr_built <dbl> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 20…
## $ lat <dbl> 47.5112, 47.7210, 47.7379, 47.5208, 47.6168, 47.6561, 47…
## $ long <dbl> -122.257, -122.319, -122.233, -122.393, -122.045, -122.0…
## $ renovated <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
Systematically build a regression model containing up to four main effects (remember, a main effect is just a single predictor with coefficient), testing the regression diagnostics as you go
* splitting datasets into numeric and non-numeric columns might help `ggpairs()` run in manageable time, although you will need to add either a `price` or `resid` column to the non-numeric dataframe in order to see its correlations with the non-numeric predictors. You can do it in the following way:
houses_tidy_numeric <- houses_tidy %>%
select_if(is.numeric)
houses_tidy_nonnumeric <- houses_tidy %>%
select_if(function(x) !is.numeric(x))
houses_tidy_nonnumeric$price <- houses_tidy$price
Solution
ggpairs(houses_tidy_numeric)
ggpairs(houses_tidy_nonnumeric)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
You can also subset your ggpairs plot doing the following:
ggpairs(houses_tidy_numeric, columns = c(1, 2, 3, 4))
and the same in subsequent rounds of predictor selection with the
resid column.
Remember, if you are not sure whether including a categorical
predictor is statistically justified, run an anova() test
passing in the models with- and without the categorical predictor and
check the p-value of the test.
houses_tidy_numeric <- houses_tidy %>%
select_if(is.numeric)
houses_tidy_nonnumeric <- houses_tidy %>%
select_if(function(x) !is.numeric(x))
houses_tidy_nonnumeric$price <- houses_tidy$price
ggpairs(houses_tidy_numeric)
ggpairs(houses_tidy_nonnumeric)
Correlation of sqft_above with price looks
pretty promising, but split of price by grade
and waterfront also look decent.
houses_tidy %>%
ggplot(aes(x = grade, y = price)) +
geom_boxplot()
houses_tidy %>%
ggplot(aes(x = waterfront, y = price)) +
geom_boxplot()
Start building the models:
# model with price and sqft above
mod1a <- lm(price ~ sqft_above, data = houses_tidy)
summary(mod1a)
##
## Call:
## lm(formula = price ~ sqft_above, data = houses_tidy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -913132 -165624 -41468 109327 5339232
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59953.2 4729.8 12.68 <2e-16 ***
## sqft_above 268.5 2.4 111.87 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 292200 on 21611 degrees of freedom
## Multiple R-squared: 0.3667, Adjusted R-squared: 0.3667
## F-statistic: 1.251e+04 on 1 and 21611 DF, p-value: < 2.2e-16
mod1b <- lm(price ~ grade, data = houses_tidy)
summary(mod1b)
##
## Call:
## lm(formula = price ~ grade, data = houses_tidy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1929615 -135853 -35090 89080 5565658
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 142000 254499 0.558 0.576878
## grade3 63667 293870 0.217 0.828484
## grade4 72381 258849 0.280 0.779767
## grade5 106524 255024 0.418 0.676169
## grade6 159920 254561 0.628 0.529868
## grade7 260590 254513 1.024 0.305904
## grade8 400853 254520 1.575 0.115285
## grade9 631513 254547 2.481 0.013112 *
## grade10 929771 254611 3.652 0.000261 ***
## grade11 1354842 254817 5.317 1.07e-07 ***
## grade12 2049222 255909 8.008 1.23e-15 ***
## grade13 3567615 264106 13.508 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 254500 on 21601 degrees of freedom
## Multiple R-squared: 0.5197, Adjusted R-squared: 0.5195
## F-statistic: 2125 on 11 and 21601 DF, p-value: < 2.2e-16
mod1c <- lm(price ~ waterfront, data = houses_tidy)
summary(mod1c)
##
## Call:
## lm(formula = price ~ waterfront, data = houses_tidy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1376876 -211564 -81564 108436 7168436
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 531564 2416 220.00 <2e-16 ***
## waterfrontTRUE 1130312 27822 40.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 353900 on 21611 degrees of freedom
## Multiple R-squared: 0.07095, Adjusted R-squared: 0.07091
## F-statistic: 1650 on 1 and 21611 DF, p-value: < 2.2e-16
Grade looks the most promising, but some of the grade level coefficients are insignificant.
From the F-test at the bottom of the regression output tests against the null model (i.e. intercept only). But, if we want, we can replicate this using a separate anova. For this, our null model would be to regress price on intercept only.
You can do this in the following way:
null_model <- lm(price ~ 1, data = houses_tidy)
grade_model <- lm(price ~ grade, data = houses_tidy)
anova(null_model, grade_model)
# grade is significant, let's keep it. Now plot diagnostics
par(mfrow = c(2, 2))
plot(mod1b)
## Warning: not plotting observations with leverage one:
## 19453
houses_resid <- houses_tidy %>%
add_residuals(mod1b) %>%
select(-c("price", "grade"))
houses_resid_numeric <- houses_resid %>%
select_if(is.numeric)
houses_resid_nonnumeric <- houses_resid %>%
select_if(function(x) !is.numeric(x))
houses_resid_nonnumeric$resid <- houses_resid$resid
ggpairs(houses_resid_numeric)
ggpairs(houses_resid_nonnumeric)
Our predictor lat has highest correlation with
residuals, but waterfront and view look pretty
promising. Try all three…
mod2a <- lm(price ~ grade + lat, data = houses_tidy)
summary(mod2a)
##
## Call:
## lm(formula = price ~ grade + lat, data = houses_tidy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1863783 -112379 -28181 67454 5533910
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -29949769 610666 -49.044 < 2e-16 ***
## grade3 187923 276126 0.681 0.496151
## grade4 95367 243212 0.392 0.694978
## grade5 126354 239618 0.527 0.597980
## grade6 158453 239183 0.662 0.507673
## grade7 246275 239138 1.030 0.303093
## grade8 378635 239144 1.583 0.113369
## grade9 603010 239170 2.521 0.011701 *
## grade10 891752 239230 3.728 0.000194 ***
## grade11 1311124 239425 5.476 4.4e-08 ***
## grade12 2007093 240450 8.347 < 2e-16 ***
## grade13 3502099 248154 14.113 < 2e-16 ***
## lat 633100 11822 53.554 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 239100 on 21600 degrees of freedom
## Multiple R-squared: 0.576, Adjusted R-squared: 0.5758
## F-statistic: 2445 on 12 and 21600 DF, p-value: < 2.2e-16
mod2b <- lm(price ~ grade + waterfront, data = houses_tidy)
summary(mod2b)
##
## Call:
## lm(formula = price ~ grade + waterfront, data = houses_tidy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1929615 -132969 -32393 91481 4778940
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 142000 244366 0.581 0.5612
## grade3 63667 282169 0.226 0.8215
## grade4 72381 248543 0.291 0.7709
## grade5 92834 244870 0.379 0.7046
## grade6 155043 244426 0.634 0.5259
## grade7 258469 244379 1.058 0.2902
## grade8 395393 244386 1.618 0.1057
## grade9 623595 244413 2.551 0.0107 *
## grade10 909321 244474 3.719 0.0002 ***
## grade11 1313326 244674 5.368 8.06e-08 ***
## grade12 1947993 245731 7.927 2.35e-15 ***
## grade13 3567615 253590 14.068 < 2e-16 ***
## waterfrontTRUE 828234 19363 42.773 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 244400 on 21600 degrees of freedom
## Multiple R-squared: 0.5572, Adjusted R-squared: 0.557
## F-statistic: 2265 on 12 and 21600 DF, p-value: < 2.2e-16
mod2c <- lm(price ~ grade + view, data = houses_tidy)
summary(mod2c)
##
## Call:
## lm(formula = price ~ grade + view, data = houses_tidy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1724708 -128875 -30577 91175 5534514
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 142000 240508 0.590 0.554918
## grade3 63667 277715 0.229 0.818675
## grade4 53178 244622 0.217 0.827909
## grade5 90541 241005 0.376 0.707157
## grade6 150434 240568 0.625 0.531762
## grade7 250923 240522 1.043 0.296846
## grade8 376825 240529 1.567 0.117211
## grade9 588427 240556 2.446 0.014449 *
## grade10 861368 240619 3.580 0.000345 ***
## grade11 1246776 240821 5.177 2.27e-07 ***
## grade12 1853274 241873 7.662 1.90e-14 ***
## grade13 3362708 249623 13.471 < 2e-16 ***
## view1 208846 13343 15.652 < 2e-16 ***
## view2 139209 8023 17.351 < 2e-16 ***
## view3 228007 10937 20.847 < 2e-16 ***
## view4 596719 13855 43.070 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 240500 on 21597 degrees of freedom
## Multiple R-squared: 0.5711, Adjusted R-squared: 0.5708
## F-statistic: 1917 on 15 and 21597 DF, p-value: < 2.2e-16
# lat is significant and has a slightly higher r^2, let's keep model2a
par(mfrow = c(2, 2))
plot(mod2a)
## Warning: not plotting observations with leverage one:
## 19453
houses_resid <- houses_tidy %>%
add_residuals(mod2a) %>%
select(-c("price", "grade", "lat"))
houses_resid_numeric <- houses_resid %>%
select_if(is.numeric)
houses_resid_nonnumeric <- houses_resid %>%
select_if(function(x) !is.numeric(x))
houses_resid_nonnumeric$resid <- houses_resid$resid
ggpairs(houses_resid_numeric)
ggpairs(houses_resid_nonnumeric)
Now sqft_basement has strongest correlation with
residuals, but also compare against model with waterfront
and view.
mod3a <- lm(price ~ grade + lat + sqft_basement, data = houses_tidy)
summary(mod3a)
##
## Call:
## lm(formula = price ~ grade + lat + sqft_basement, data = houses_tidy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1642900 -110556 -22563 70920 5247787
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.775e+07 5.872e+05 -47.259 < 2e-16 ***
## grade3 1.788e+05 2.645e+05 0.676 0.499028
## grade4 9.259e+04 2.330e+05 0.397 0.691090
## grade5 1.188e+05 2.296e+05 0.517 0.604927
## grade6 1.391e+05 2.291e+05 0.607 0.543956
## grade7 2.028e+05 2.291e+05 0.885 0.376173
## grade8 3.299e+05 2.291e+05 1.440 0.149897
## grade9 5.553e+05 2.291e+05 2.424 0.015379 *
## grade10 8.298e+05 2.292e+05 3.620 0.000295 ***
## grade11 1.228e+06 2.294e+05 5.355 8.66e-08 ***
## grade12 1.880e+06 2.304e+05 8.159 3.55e-16 ***
## grade13 3.281e+06 2.378e+05 13.799 < 2e-16 ***
## lat 5.868e+05 1.137e+04 51.589 < 2e-16 ***
## sqft_basement 1.587e+02 3.607e+00 43.992 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 229100 on 21599 degrees of freedom
## Multiple R-squared: 0.6109, Adjusted R-squared: 0.6106
## F-statistic: 2608 on 13 and 21599 DF, p-value: < 2.2e-16
mod3b <- lm(price ~ grade + lat + waterfront, data = houses_tidy)
summary(mod3b)
##
## Call:
## lm(formula = price ~ grade + lat + waterfront, data = houses_tidy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1862555 -109244 -24781 70145 4724767
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -30511083 581588 -52.462 < 2e-16 ***
## grade3 190241 262923 0.724 0.469343
## grade4 95796 231583 0.414 0.679130
## grade5 112654 228161 0.494 0.621488
## grade6 153414 227746 0.674 0.500562
## grade7 243828 227703 1.071 0.284264
## grade8 372609 227709 1.636 0.101783
## grade9 594340 227734 2.610 0.009066 **
## grade10 870026 227792 3.819 0.000134 ***
## grade11 1267641 227979 5.560 2.72e-08 ***
## grade12 1902270 228964 8.308 < 2e-16 ***
## grade13 3500877 236288 14.816 < 2e-16 ***
## lat 644910 11259 57.278 < 2e-16 ***
## waterfrontTRUE 851219 18046 47.168 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 227700 on 21599 degrees of freedom
## Multiple R-squared: 0.6156, Adjusted R-squared: 0.6154
## F-statistic: 2661 on 13 and 21599 DF, p-value: < 2.2e-16
mod3c <- lm(price ~ grade + lat + view, data = houses_tidy)
summary(mod3c)
##
## Call:
## lm(formula = price ~ grade + lat + view, data = houses_tidy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1657374 -105914 -21840 69071 5500673
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -30312052 571751 -53.016 < 2e-16 ***
## grade3 189419 258409 0.733 0.463554
## grade4 76878 227609 0.338 0.735544
## grade5 110481 224243 0.493 0.622243
## grade6 148798 223836 0.665 0.506208
## grade7 236266 223794 1.056 0.291103
## grade8 353834 223801 1.581 0.113887
## grade9 558576 223826 2.496 0.012583 *
## grade10 821334 223885 3.669 0.000245 ***
## grade11 1200200 224073 5.356 8.58e-08 ***
## grade12 1807909 225052 8.033 9.97e-16 ***
## grade13 3292412 232265 14.175 < 2e-16 ***
## lat 640722 11069 57.882 < 2e-16 ***
## view1 200472 12416 16.147 < 2e-16 ***
## view2 143253 7465 19.189 < 2e-16 ***
## view3 245505 10181 24.114 < 2e-16 ***
## view4 598302 12891 46.412 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 223800 on 21596 degrees of freedom
## Multiple R-squared: 0.6287, Adjusted R-squared: 0.6285
## F-statistic: 2286 on 16 and 21596 DF, p-value: < 2.2e-16
# view model is best, keep mod3c
par(mfrow = c(2, 2))
plot(mod3c)
## Warning: not plotting observations with leverage one:
## 19453
houses_resid <- houses_tidy %>%
add_residuals(mod3c) %>%
select(-c("price", "grade", "lat", "view"))
houses_resid_numeric <- houses_resid %>%
select_if(is.numeric)
houses_resid_nonnumeric <- houses_resid %>%
select_if(function(x) !is.numeric(x))
houses_resid_nonnumeric$resid <- houses_resid$resid
ggpairs(houses_resid_numeric)
ggpairs(houses_resid_nonnumeric)
sqft_basement has highest correlation with residuals.
Let’s test against all remaining categorical predictors:
mod4a <- lm(price ~ grade + lat + view + sqft_basement, data = houses_tidy)
summary(mod4a)
##
## Call:
## lm(formula = price ~ grade + lat + view + sqft_basement, data = houses_tidy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1721908 -104563 -19475 70497 5300373
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.856e+07 5.596e+05 -51.038 < 2e-16 ***
## grade3 1.822e+05 2.518e+05 0.723 0.469421
## grade4 7.896e+04 2.218e+05 0.356 0.721892
## grade5 1.072e+05 2.185e+05 0.490 0.623801
## grade6 1.359e+05 2.182e+05 0.623 0.533255
## grade7 2.054e+05 2.181e+05 0.942 0.346393
## grade8 3.218e+05 2.181e+05 1.475 0.140136
## grade9 5.307e+05 2.181e+05 2.433 0.014993 *
## grade10 7.864e+05 2.182e+05 3.604 0.000314 ***
## grade11 1.157e+06 2.184e+05 5.296 1.19e-07 ***
## grade12 1.739e+06 2.193e+05 7.930 2.30e-15 ***
## grade13 3.157e+06 2.264e+05 13.943 < 2e-16 ***
## lat 6.039e+05 1.084e+04 55.695 < 2e-16 ***
## view1 1.566e+05 1.217e+04 12.870 < 2e-16 ***
## view2 1.079e+05 7.351e+03 14.677 < 2e-16 ***
## view3 1.898e+05 1.006e+04 18.869 < 2e-16 ***
## view4 5.349e+05 1.270e+04 42.104 < 2e-16 ***
## sqft_basement 1.202e+02 3.557e+00 33.787 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 218100 on 21595 degrees of freedom
## Multiple R-squared: 0.6474, Adjusted R-squared: 0.6471
## F-statistic: 2332 on 17 and 21595 DF, p-value: < 2.2e-16
mod4b <- lm(price ~ grade + lat + view + waterfront, data = houses_tidy)
summary(mod4b)
##
## Call:
## lm(formula = price ~ grade + lat + view + waterfront, data = houses_tidy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1711573 -105554 -21218 69955 4959192
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -30635387 563469 -54.369 < 2e-16 ***
## grade3 190754 254602 0.749 0.453729
## grade4 77432 224255 0.345 0.729882
## grade5 105683 220940 0.478 0.632416
## grade6 147178 220538 0.667 0.504551
## grade7 235834 220496 1.070 0.284830
## grade8 353620 220503 1.604 0.108796
## grade9 560970 220528 2.544 0.010974 *
## grade10 822756 220586 3.730 0.000192 ***
## grade11 1195804 220772 5.416 6.14e-08 ***
## grade12 1800332 221736 8.119 4.94e-16 ***
## grade13 3349896 228854 14.638 < 2e-16 ***
## lat 647525 10910 59.354 < 2e-16 ***
## view1 198491 12233 16.226 < 2e-16 ***
## view2 138307 7358 18.797 < 2e-16 ***
## view3 224548 10064 22.311 < 2e-16 ***
## view4 365071 15646 23.334 < 2e-16 ***
## waterfrontTRUE 550012 21544 25.529 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 220500 on 21595 degrees of freedom
## Multiple R-squared: 0.6396, Adjusted R-squared: 0.6393
## F-statistic: 2254 on 17 and 21595 DF, p-value: < 2.2e-16
mod4c <- lm(price ~ grade + lat + view + condition, data = houses_tidy)
summary(mod4c)
##
## Call:
## lm(formula = price ~ grade + lat + view + condition, data = houses_tidy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1653876 -105002 -18772 71134 5519169
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -30235579 562306 -53.771 < 2e-16 ***
## grade3 131612 257186 0.512 0.608841
## grade4 46792 227085 0.206 0.836750
## grade5 60902 223799 0.272 0.785528
## grade6 102319 223641 0.458 0.647306
## grade7 195346 223637 0.873 0.382404
## grade8 323199 223648 1.445 0.148438
## grade9 534908 223677 2.391 0.016791 *
## grade10 800844 223736 3.579 0.000345 ***
## grade11 1184526 223920 5.290 1.24e-07 ***
## grade12 1794154 224872 7.979 1.55e-15 ***
## grade13 3283753 231841 14.164 < 2e-16 ***
## lat 639114 10889 58.694 < 2e-16 ***
## view1 189011 12204 15.488 < 2e-16 ***
## view2 132179 7344 17.999 < 2e-16 ***
## view3 232858 10012 23.258 < 2e-16 ***
## view4 577160 12688 45.488 < 2e-16 ***
## condition2 15201 44266 0.343 0.731306
## condition3 8443 41173 0.205 0.837521
## condition4 71904 41207 1.745 0.081006 .
## condition5 148832 41433 3.592 0.000329 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 219800 on 21592 degrees of freedom
## Multiple R-squared: 0.6418, Adjusted R-squared: 0.6415
## F-statistic: 1935 on 20 and 21592 DF, p-value: < 2.2e-16
mod4d <- lm(price ~ grade + lat + view + renovated, data = houses_tidy)
summary(mod4d)
##
## Call:
## lm(formula = price ~ grade + lat + view + renovated, data = houses_tidy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1637888 -105444 -20225 70144 5353486
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -29993826 566490 -52.947 < 2e-16 ***
## grade3 188105 255935 0.735 0.462365
## grade4 72163 225430 0.320 0.748886
## grade5 107085 222097 0.482 0.629702
## grade6 141858 221693 0.640 0.522256
## grade7 231058 221652 1.042 0.297221
## grade8 348367 221658 1.572 0.116048
## grade9 552595 221684 2.493 0.012684 *
## grade10 817843 221742 3.688 0.000226 ***
## grade11 1199479 221928 5.405 6.56e-08 ***
## grade12 1811114 222898 8.125 4.70e-16 ***
## grade13 3265820 230045 14.196 < 2e-16 ***
## lat 634027 10968 57.805 < 2e-16 ***
## view1 190965 12306 15.519 < 2e-16 ***
## view2 137231 7400 18.546 < 2e-16 ***
## view3 233695 10100 23.138 < 2e-16 ***
## view4 575708 12815 44.924 < 2e-16 ***
## renovatedTRUE 154728 7545 20.506 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 221600 on 21595 degrees of freedom
## Multiple R-squared: 0.6358, Adjusted R-squared: 0.6355
## F-statistic: 2218 on 17 and 21595 DF, p-value: < 2.2e-16
# looks like model with sqft_basement is best, keep mod4a
par(mfrow = c(2, 2))
plot(mod4a)
## Warning: not plotting observations with leverage one:
## 19453
houses_resid <- houses_tidy %>%
add_residuals(mod4a) %>%
select(- price)
Our final model in terms of main effects is:
price ~ grade + lat + view + sqft_basement
Consider possible interactions between your four main effect predictors and test their effect upon \(r^2\). Choose your best candidate interaction and visualise its effect.
Calculate the relative importance of predictors from your best
\(4\)-predictor model (i.e. the model
without an interaction). Which predictor affects price most
strongly?
Solution
Now, for interactions, have six possibilities that obey principle of strong hierarchy (i.e. consider including an interaction only if its main effects are already present in the model)
mod5a <- lm(price ~ grade + lat + view + sqft_basement + grade:lat, data = houses_tidy)
summary(mod5a)
##
## Call:
## lm(formula = price ~ grade + lat + view + sqft_basement + grade:lat,
## data = houses_tidy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1614350 -100929 -21140 67905 5265776
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.573e+07 4.389e+07 -0.358 0.7200
## grade3 2.660e+06 6.068e+07 0.044 0.9650
## grade4 8.637e+06 4.759e+07 0.181 0.8560
## grade5 2.062e+06 4.422e+07 0.047 0.9628
## grade6 -3.792e+06 4.393e+07 -0.086 0.9312
## grade7 -8.844e+06 4.390e+07 -0.201 0.8403
## grade8 -1.241e+07 4.390e+07 -0.283 0.7773
## grade9 -2.805e+07 4.392e+07 -0.639 0.5230
## grade10 -4.186e+07 4.399e+07 -0.952 0.3414
## grade11 -4.276e+07 4.430e+07 -0.965 0.3345
## grade12 -1.069e+08 4.593e+07 -2.327 0.0200 *
## grade13 3.184e+06 2.441e+05 13.040 <2e-16 ***
## lat 3.340e+05 9.234e+05 0.362 0.7176
## view1 1.559e+05 1.208e+04 12.906 <2e-16 ***
## view2 1.095e+05 7.300e+03 14.999 <2e-16 ***
## view3 1.945e+05 9.992e+03 19.465 <2e-16 ***
## view4 5.322e+05 1.262e+04 42.157 <2e-16 ***
## sqft_basement 1.206e+02 3.532e+00 34.143 <2e-16 ***
## grade3:lat -5.347e+04 1.279e+06 -0.042 0.9667
## grade4:lat -1.804e+05 1.001e+06 -0.180 0.8570
## grade5:lat -4.133e+04 9.303e+05 -0.044 0.9646
## grade6:lat 8.265e+04 9.241e+05 0.089 0.9287
## grade7:lat 1.904e+05 9.235e+05 0.206 0.8366
## grade8:lat 2.679e+05 9.236e+05 0.290 0.7717
## grade9:lat 6.011e+05 9.240e+05 0.651 0.5154
## grade10:lat 8.964e+05 9.255e+05 0.969 0.3328
## grade11:lat 9.230e+05 9.320e+05 0.990 0.3220
## grade12:lat 2.283e+06 9.663e+05 2.362 0.0182 *
## grade13:lat NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 216500 on 21585 degrees of freedom
## Multiple R-squared: 0.6527, Adjusted R-squared: 0.6523
## F-statistic: 1503 on 27 and 21585 DF, p-value: < 2.2e-16
mod5b <- lm(price ~ grade + lat + view + sqft_basement + grade:view, data = houses_tidy)
summary(mod5b)
##
## Call:
## lm(formula = price ~ grade + lat + view + sqft_basement + grade:view,
## data = houses_tidy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2194474 -104235 -20452 69331 5114365
##
## Coefficients: (11 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.858e+07 5.514e+05 -51.825 < 2e-16 ***
## grade3 1.822e+05 2.480e+05 0.735 0.462370
## grade4 8.422e+04 2.188e+05 0.385 0.700344
## grade5 1.122e+05 2.152e+05 0.521 0.602262
## grade6 1.384e+05 2.148e+05 0.645 0.519252
## grade7 2.086e+05 2.148e+05 0.971 0.331361
## grade8 3.237e+05 2.148e+05 1.507 0.131773
## grade9 5.378e+05 2.148e+05 2.504 0.012290 *
## grade10 7.624e+05 2.149e+05 3.548 0.000389 ***
## grade11 1.069e+06 2.152e+05 4.969 6.77e-07 ***
## grade12 1.806e+06 2.172e+05 8.313 < 2e-16 ***
## grade13 2.763e+06 2.353e+05 11.745 < 2e-16 ***
## lat 6.042e+05 1.068e+04 56.547 < 2e-16 ***
## view1 1.034e+06 1.554e+05 6.657 2.87e-11 ***
## view2 -2.830e+05 1.568e+05 -1.805 0.071132 .
## view3 2.136e+06 1.797e+05 11.883 < 2e-16 ***
## view4 1.357e+06 1.568e+05 8.652 < 2e-16 ***
## sqft_basement 1.163e+02 3.524e+00 33.002 < 2e-16 ***
## grade3:view1 NA NA NA NA
## grade4:view1 -9.306e+05 2.213e+05 -4.205 2.62e-05 ***
## grade5:view1 -1.003e+06 2.177e+05 -4.605 4.16e-06 ***
## grade6:view1 -9.869e+05 1.650e+05 -5.979 2.28e-09 ***
## grade7:view1 -9.138e+05 1.570e+05 -5.822 5.90e-09 ***
## grade8:view1 -8.973e+05 1.567e+05 -5.725 1.05e-08 ***
## grade9:view1 -8.906e+05 1.578e+05 -5.645 1.67e-08 ***
## grade10:view1 -8.115e+05 1.611e+05 -5.037 4.78e-07 ***
## grade11:view1 -5.043e+05 1.640e+05 -3.076 0.002104 **
## grade12:view1 NA NA NA NA
## grade13:view1 NA NA NA NA
## grade3:view2 NA NA NA NA
## grade4:view2 3.451e+05 2.692e+05 1.282 0.199908
## grade5:view2 4.383e+05 1.802e+05 2.432 0.015033 *
## grade6:view2 3.968e+05 1.604e+05 2.474 0.013350 *
## grade7:view2 3.641e+05 1.575e+05 2.312 0.020806 *
## grade8:view2 3.970e+05 1.573e+05 2.524 0.011620 *
## grade9:view2 3.892e+05 1.577e+05 2.468 0.013584 *
## grade10:view2 3.867e+05 1.584e+05 2.441 0.014654 *
## grade11:view2 6.733e+05 1.599e+05 4.211 2.56e-05 ***
## grade12:view2 7.941e+04 1.709e+05 0.465 0.642233
## grade13:view2 NA NA NA NA
## grade3:view3 NA NA NA NA
## grade4:view3 NA NA NA NA
## grade5:view3 -2.041e+06 2.804e+05 -7.280 3.45e-13 ***
## grade6:view3 -1.971e+06 1.887e+05 -10.446 < 2e-16 ***
## grade7:view3 -2.027e+06 1.813e+05 -11.176 < 2e-16 ***
## grade8:view3 -1.954e+06 1.806e+05 -10.820 < 2e-16 ***
## grade9:view3 -1.940e+06 1.807e+05 -10.732 < 2e-16 ***
## grade10:view3 -1.791e+06 1.813e+05 -9.875 < 2e-16 ***
## grade11:view3 -1.870e+06 1.830e+05 -10.218 < 2e-16 ***
## grade12:view3 -2.662e+06 1.948e+05 -13.663 < 2e-16 ***
## grade13:view3 NA NA NA NA
## grade3:view4 NA NA NA NA
## grade4:view4 NA NA NA NA
## grade5:view4 -1.098e+06 1.906e+05 -5.759 8.57e-09 ***
## grade6:view4 -1.024e+06 1.687e+05 -6.069 1.31e-09 ***
## grade7:view4 -9.416e+05 1.615e+05 -5.829 5.65e-09 ***
## grade8:view4 -8.595e+05 1.588e+05 -5.411 6.33e-08 ***
## grade9:view4 -1.031e+06 1.589e+05 -6.485 9.05e-11 ***
## grade10:view4 -5.960e+05 1.592e+05 -3.742 0.000183 ***
## grade11:view4 -5.687e+05 1.616e+05 -3.520 0.000433 ***
## grade12:view4 -6.514e+05 1.667e+05 -3.909 9.31e-05 ***
## grade13:view4 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 214700 on 21562 degrees of freedom
## Multiple R-squared: 0.6587, Adjusted R-squared: 0.6579
## F-statistic: 832.1 on 50 and 21562 DF, p-value: < 2.2e-16
mod5c <- lm(price ~ grade + lat + view + sqft_basement + grade:sqft_basement, data = houses_tidy)
summary(mod5c)
##
## Call:
## lm(formula = price ~ grade + lat + view + sqft_basement + grade:sqft_basement,
## data = houses_tidy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2261338 -102336 -20466 68443 4995869
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.901e+07 5.529e+05 -52.467 < 2e-16 ***
## grade3 1.840e+05 2.486e+05 0.740 0.459091
## grade4 7.449e+04 2.191e+05 0.340 0.733841
## grade5 1.080e+05 2.157e+05 0.501 0.616656
## grade6 1.427e+05 2.153e+05 0.663 0.507414
## grade7 2.189e+05 2.153e+05 1.017 0.309241
## grade8 3.323e+05 2.153e+05 1.544 0.122673
## grade9 5.215e+05 2.153e+05 2.422 0.015444 *
## grade10 7.604e+05 2.154e+05 3.530 0.000416 ***
## grade11 1.071e+06 2.157e+05 4.965 6.91e-07 ***
## grade12 1.602e+06 2.172e+05 7.375 1.71e-13 ***
## grade13 2.022e+06 2.352e+05 8.598 < 2e-16 ***
## lat 6.133e+05 1.071e+04 57.241 < 2e-16 ***
## view1 1.530e+05 1.202e+04 12.723 < 2e-16 ***
## view2 1.062e+05 7.274e+03 14.600 < 2e-16 ***
## view3 1.722e+05 9.991e+03 17.238 < 2e-16 ***
## view4 5.077e+05 1.266e+04 40.109 < 2e-16 ***
## sqft_basement 9.238e+02 5.159e+01 17.906 < 2e-16 ***
## grade3:sqft_basement NA NA NA NA
## grade4:sqft_basement -6.202e+01 1.097e+03 -0.057 0.954902
## grade5:sqft_basement -8.016e+02 1.056e+02 -7.591 3.32e-14 ***
## grade6:sqft_basement -8.565e+02 5.464e+01 -15.675 < 2e-16 ***
## grade7:sqft_basement -8.514e+02 5.193e+01 -16.396 < 2e-16 ***
## grade8:sqft_basement -8.351e+02 5.197e+01 -16.069 < 2e-16 ***
## grade9:sqft_basement -7.699e+02 5.227e+01 -14.730 < 2e-16 ***
## grade10:sqft_basement -7.336e+02 5.271e+01 -13.918 < 2e-16 ***
## grade11:sqft_basement -6.381e+02 5.375e+01 -11.871 < 2e-16 ***
## grade12:sqft_basement -6.263e+02 5.601e+01 -11.182 < 2e-16 ***
## grade13:sqft_basement NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 215300 on 21586 degrees of freedom
## Multiple R-squared: 0.6566, Adjusted R-squared: 0.6562
## F-statistic: 1588 on 26 and 21586 DF, p-value: < 2.2e-16
mod5d <- lm(price ~ grade + lat + view + sqft_basement + lat:view, data = houses_tidy)
summary(mod5d)
##
## Call:
## lm(formula = price ~ grade + lat + view + sqft_basement + lat:view,
## data = houses_tidy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1679367 -103252 -20586 69103 5277777
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.681e+07 5.752e+05 -46.605 < 2e-16 ***
## grade3 1.749e+05 2.508e+05 0.698 0.485405
## grade4 7.832e+04 2.209e+05 0.355 0.722900
## grade5 1.085e+05 2.176e+05 0.499 0.618094
## grade6 1.376e+05 2.172e+05 0.634 0.526358
## grade7 2.073e+05 2.172e+05 0.955 0.339806
## grade8 3.231e+05 2.172e+05 1.488 0.136877
## grade9 5.318e+05 2.172e+05 2.448 0.014358 *
## grade10 7.862e+05 2.173e+05 3.619 0.000296 ***
## grade11 1.153e+06 2.174e+05 5.304 1.14e-07 ***
## grade12 1.731e+06 2.184e+05 7.925 2.40e-15 ***
## grade13 3.118e+06 2.254e+05 13.832 < 2e-16 ***
## lat 5.669e+05 1.121e+04 50.596 < 2e-16 ***
## view1 -2.281e+07 5.489e+06 -4.156 3.26e-05 ***
## view2 -1.762e+07 2.730e+06 -6.456 1.10e-10 ***
## view3 -2.067e+07 3.479e+06 -5.941 2.87e-09 ***
## view4 -5.281e+07 5.178e+06 -10.201 < 2e-16 ***
## sqft_basement 1.188e+02 3.544e+00 33.510 < 2e-16 ***
## lat:view1 4.828e+05 1.154e+05 4.185 2.87e-05 ***
## lat:view2 3.728e+05 5.739e+04 6.496 8.44e-11 ***
## lat:view3 4.387e+05 7.317e+04 5.996 2.06e-09 ***
## lat:view4 1.121e+06 1.088e+05 10.304 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 217200 on 21591 degrees of freedom
## Multiple R-squared: 0.6505, Adjusted R-squared: 0.6501
## F-statistic: 1913 on 21 and 21591 DF, p-value: < 2.2e-16
mod5e <- lm(price ~ grade + lat + view + sqft_basement + lat:sqft_basement, data = houses_tidy)
summary(mod5e)
##
## Call:
## lm(formula = price ~ grade + lat + view + sqft_basement + lat:sqft_basement,
## data = houses_tidy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1700954 -101315 -19149 68149 5245821
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.484e+07 6.302e+05 -39.414 < 2e-16 ***
## grade3 1.668e+05 2.509e+05 0.665 0.506163
## grade4 7.638e+04 2.210e+05 0.346 0.729639
## grade5 1.057e+05 2.177e+05 0.486 0.627219
## grade6 1.366e+05 2.173e+05 0.628 0.529733
## grade7 2.071e+05 2.173e+05 0.953 0.340637
## grade8 3.233e+05 2.173e+05 1.488 0.136800
## grade9 5.323e+05 2.173e+05 2.449 0.014328 *
## grade10 7.889e+05 2.174e+05 3.629 0.000285 ***
## grade11 1.159e+06 2.176e+05 5.325 1.02e-07 ***
## grade12 1.739e+06 2.185e+05 7.959 1.81e-15 ***
## grade13 3.132e+06 2.256e+05 13.884 < 2e-16 ***
## lat 5.256e+05 1.245e+04 42.228 < 2e-16 ***
## view1 1.554e+05 1.213e+04 12.812 < 2e-16 ***
## view2 1.095e+05 7.325e+03 14.952 < 2e-16 ***
## view3 1.907e+05 1.002e+04 19.027 < 2e-16 ***
## view4 5.325e+05 1.266e+04 42.068 < 2e-16 ***
## sqft_basement -1.633e+04 1.298e+03 -12.578 < 2e-16 ***
## lat:sqft_basement 3.458e+02 2.729e+01 12.671 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 217300 on 21594 degrees of freedom
## Multiple R-squared: 0.65, Adjusted R-squared: 0.6497
## F-statistic: 2228 on 18 and 21594 DF, p-value: < 2.2e-16
mod5f <- lm(price ~ grade + lat + view + sqft_basement + view:sqft_basement, data = houses_tidy)
summary(mod5f)
##
## Call:
## lm(formula = price ~ grade + lat + view + sqft_basement + view:sqft_basement,
## data = houses_tidy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1765104 -103878 -19859 69921 5256041
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.863e+07 5.591e+05 -51.208 < 2e-16 ***
## grade3 1.825e+05 2.515e+05 0.725 0.468204
## grade4 8.250e+04 2.216e+05 0.372 0.709627
## grade5 1.097e+05 2.183e+05 0.503 0.615290
## grade6 1.383e+05 2.179e+05 0.635 0.525565
## grade7 2.090e+05 2.179e+05 0.959 0.337456
## grade8 3.250e+05 2.179e+05 1.492 0.135731
## grade9 5.329e+05 2.179e+05 2.446 0.014465 *
## grade10 7.884e+05 2.179e+05 3.617 0.000298 ***
## grade11 1.155e+06 2.181e+05 5.296 1.20e-07 ***
## grade12 1.723e+06 2.191e+05 7.862 3.94e-15 ***
## grade13 3.135e+06 2.262e+05 13.863 < 2e-16 ***
## lat 6.054e+05 1.083e+04 55.877 < 2e-16 ***
## view1 1.159e+05 1.886e+04 6.147 8.03e-10 ***
## view2 9.032e+04 1.026e+04 8.802 < 2e-16 ***
## view3 1.482e+05 1.498e+04 9.895 < 2e-16 ***
## view4 4.715e+05 1.819e+04 25.924 < 2e-16 ***
## sqft_basement 1.080e+02 3.960e+00 27.268 < 2e-16 ***
## view1:sqft_basement 7.193e+01 2.310e+01 3.113 0.001853 **
## view2:sqft_basement 3.896e+01 1.330e+01 2.928 0.003411 **
## view3:sqft_basement 6.562e+01 1.571e+01 4.176 2.98e-05 ***
## view4:sqft_basement 8.729e+01 1.652e+01 5.284 1.28e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 217800 on 21591 degrees of freedom
## Multiple R-squared: 0.6483, Adjusted R-squared: 0.6479
## F-statistic: 1895 on 21 and 21591 DF, p-value: < 2.2e-16
# mod5c looks like the best
par(mfrow = c(2,2))
plot(mod5c)
## Warning: not plotting observations with leverage one:
## 8620, 19453
It seems that the grade:sqft_basement interaction leads
to highest \(r^2\) (but two levels of
the interaction cannot be determined due to fitting problems).
Now let’s see a visualisation of the effect of this interaction.
houses_resid %>%
ggplot(aes(x = sqft_basement, y = resid, colour = grade)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~ grade)
## `geom_smooth()` using formula 'y ~ x'
Relative importance of predictors:
library(relaimpo)
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## Loading required package: boot
## Loading required package: survey
## Loading required package: grid
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:boot':
##
## aml
##
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
##
## dotchart
## Loading required package: mitools
## This is the global version of package relaimpo.
## If you are a non-US user, a version with the interesting additional metric pmvd is available
## from Ulrike Groempings web site at prof.beuth-hochschule.de/groemping.
calc.relimp(mod4a, method = "lmg", rela = TRUE)
## Response variable: price
## Total response variance: 134782378397
## Analysis based on 21613 observations
##
## 17 Regressors:
## Some regressors combined in groups:
## Group grade : grade3 grade4 grade5 grade6 grade7 grade8 grade9 grade10 grade11 grade12 grade13
## Group view : view1 view2 view3 view4
##
## Relative importance of 4 (groups of) regressors assessed:
## grade view lat sqft_basement
##
## Proportion of variance explained by model: 64.74%
## Metrics are normalized to sum to 100% (rela=TRUE).
##
## Relative importance metrics:
##
## lmg
## grade 0.6613505
## view 0.1475548
## lat 0.1082255
## sqft_basement 0.0828692
##
## Average coefficients for different model sizes:
##
## 1group 2groups 3groups 4groups
## grade3 63666.6672 105085.5034 143973.4826 182193.0397
## grade4 72381.0351 73237.4679 75552.2416 78956.0945
## grade5 106523.9716 105536.1415 105761.2601 107192.1633
## grade6 159919.6380 149061.7309 141055.2803 135919.3905
## grade7 260590.2629 236127.3021 217651.9965 205378.9824
## grade8 400852.7662 366830.4551 340559.3902 321796.9099
## grade9 631513.1864 589253.7503 555932.0020 530680.8349
## grade10 929771.0746 870361.6477 823031.9018 786434.2433
## grade11 1354841.7274 1272451.5902 1207113.0171 1156627.9552
## grade12 2049222.0006 1921648.1390 1819293.0741 1739316.4875
## grade13 3567615.3852 3394076.8577 3257787.9369 3156660.3522
## lat 813411.5832 720920.2912 657688.6172 603904.7514
## view1 315716.6452 250081.3021 198343.3721 156626.5028
## view2 295836.6957 222400.5323 160854.6835 107885.2600
## view3 475401.0824 364804.3448 271176.6966 189797.3285
## view4 967147.0461 802059.7321 660137.1873 534858.0263
## sqft_basement 268.6136 203.6304 154.3004 120.1816
It looks like the grade of property is the most
important determiner of price, followed by how good the
view of the property was.